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Abstract 



The aim of this paper is three-fold. First we show that Hessian-Free ( jMartens 
|2010| l and Krylov Subspace Descent ( |Vinyals and Povey} 2012| l can be describee 
as implementations of natural gradient descent due to their use of the extended 
Gauss-Newton approximation of the Hessian. Secondly we re-derive natural gra- 
dient from basic principles, contrasting the difference between two versions of 
the algorithm found in the neural network literature, as well as highlighting a few 
differences between natural gradient and typical second order methods. Lastly we 
show empirically that natural gradient can be robust to overfitting and particularly 
it can be robust to the order in which the training data is presented to the model. 



1 Introduction 



Several recent papers tried to address the issue of using better optimization techniques for machine 
learning, especially for training deep architectures or neural networks of various kinds. Hessian-Free 
optimization (Martens, 2010; Suts kever et al.\ [201 1| |Chapelle and Erhan| |201 l|l, Krylov Subspace 



Descent (|Vinyals and Povey| |2012|l, na tural gradient descent ( |Amari||1997||Park et al.\ 2000 Le 



Roux et al. 



usually can 



2008 



be sp 



Le Roux et al. 201 l| l are just a few of such recently proposed algorithms. They 



it in two different categories: those which make use of second order information 



and those which use the geometry of the underlying parameter manifold (natural gradient). 

One particularly interesting pipeline to scale up such algorithms was originally proposed in |Pearl- 



mutter ( 1994), finet uned in|Schraud olph (2001) and represents the backbone behind both Hessian- 
Free optimization ( Martens||2010| l andKrylov Subspace Descent ( Vinyals and Povey |2012|l. The 
core idea behind it is to make use of the forward (renamed to R-operator in |Pearlmutter| ( |1994| l) and 
backward pass of automatic differentiation to compute efficient products between Jacobian or Hes- 
sian matrices and vectors. These products are used within a truncated-Newton approach (Nocedal 
|and W right 2000) which considers the exact Hessian and only inverts it approximately without the 
need for explicitly storing the matrix in memory, as opposed to other approaches which perform a 
more crude approximation of the Hessian (or Fisher) matrix (either diagonal or block-diagonal). 

The contributions of this paper to the study of the natural gradient are as follows. We provide a de- 
tailed derivation of the natural gradient, avoiding elements of information geometry. We distinguish 
natural gradient descent from TONGA and provide arguments suggesting that natural gradient may 
also benefits from a form of robustness that should yield better generalization. The arguments for 
this robustness are different from those invoked for TONGA. We show experimentally the effects 
of this robustness when we increase the accuracy of the metric using extra unlabeled data. We also 
provide evidence that the natural gradient is robust to the order of training examples, resulting in 
lower variance as we change the order. The final contribution of the paper is to show that Martens' 
Hessian-Free approach of Martens ( 2010| > (and implicitly Krylov Subspace Descent (KSD) algo- 
rithm) can be cast into the framework of the natural gradient, showing how these methods can be 
seen as doing natural gradient rather then second order optimization. 
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2 Natural Gradient 



Natural gradient can be traced back to Amari's work on information geometry (|AmarT| [T985J and 
its application to various neural netw orks (|Amari et al.\ 1992||Amari||1997|l, t hough a more in depth 
introduction can be found in |Amari| ( fl998| l; |Park et a /.| ( |2000| l; | Arnold et al\\20\ \\. T he algorithm 
has also been successfully applied in the reinforcement learnin g community (fK akade 2001 1 [Peters | 
|and Sc haal 2008) and for stochastic search (Su n et aT\\2009\ . |Le Roux~e f al. (2007) introduces a 
different formulation of the algorithm for deep models. Although similar in name, the algorithm is 



motivated differently and is not equivalent to Amari's version, as will be shown in section |47T| 

Let us consider a family of density functions T : R p — > (B — > [0, 1]), where for every 9 £ K p , 
J 7 (9) defines a density function from B — > [0, 1] over the random variable z £ B, where B is some 
suitable numeric set of values, for e.g. B = Mr . We also define a loss function that we want to 
minimize C : E p — > K. Any choice of 9 £ M. p defines a particular density function pe(z) = F{9) 
and by considering all possible 9 values, we explore the set F, which is our functional manifold. 
Because we can define a similarity measures between nearby density functions, given by the KL- 
divergence which in its infinitesimal form behaves like a distance measure, we are dealing with a 
Riemannian manifold whose metric is given by the Fisher Information matrix. Natural gradient 
attempts to move along the manifold by correcting the gradient of C according to the local curvature 
of the KL-divergence surface Q 
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We can derive this result without relying on information geometry. We consider the natural gradient 
to be defined as the algorithm which, at each step, picks a descent direction such that the KL- 
divergence between pe and pe+Ae is constant. At each step, we need to find A9 such that: 



arg minAs C(6 + A9) 

s. t. KL(pq\ \pe+Ae) = constant 



(2) 



Using this constraint we ensure that we move along the functional manifold with constant speed, 
without being slowed down by its curvature. This also makes learning robust to re-parametrizations 
of the model, as the functional behaviour of p does not depend on how it is parametrized. 

Assuming A9 — > 0, we can approximate the KL divergence by its Taylor series: 
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The first term cancels out and because E, 
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2 log] 
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— 0, we are left with only the last term. 



d log p s (z) 

de 

The Fisher Information Matrix form can be obtain from the expected value of the Hessian through 
algebraic manipulations (see the Appendix). 

We now express equation |2]l as a Lagrangian, where the KL divergence is approximated by Q and 
C(9 + A9) by its first order Taylor series C{9) + ^p-A9: 



Throughout this paper we use the mathematical convention that a partial derivative 
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the continuous case as well, replacing sums for integrals. 
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Solving equation |5]l for AO gives us the natural gradient formula ([TJ. Note that we get a scalar 
factor of 2^ times the natural gradient. We fold this scalar into the learning rate, and hence the 
learning rate also controls the difference between po and pe+Ae that we impose at each step. Also 
the approximations we make are meaningful only around 0. Schaul (20121 suggests that using a 



large step size might be harmful for convergence. We deal with such issues both by using damping 
(i.e. setting a trust region around 0) and by properly selecting a learning rate. 



3 Natural Gradient for Neural Networks 

The natural gradient for neural networks relies on their probabilistic interpretation (which induces a 
similarity measure between different parametrization of the model) given in the form of conditional 
probabilities pg(t\x), with x representing the input and t the target. 

We make use of the following notation. q(x) describes the data generating distribution of x and 
<?(t|x) is the distribution we want to learn, y is the output of the model, and by an abuse of no- 
tation, it will refer to either the function mapping inputs to outputs, or the vector of output acti- 
vations, r is the output of the model before applying the output activation function a. v& and 
are the i-th target and input samples of the training set 

dyi dyi 



J y stands for the Jacobian matrix 
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Finally, lower indices such as yt denote the i-th element of a vector. 



We define the neural network loss as follows: 



C(0) 
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(6) 



Because we have a conditional density function pg(t\x) the formulation for the natural gradient 
changes slightly. Each value of x now defines a different family of density functions pg(t\x), and 
hence a different manifold. In order to measure the functional behaviour of p$(t\x) for different 
values of x, we use the expected value (with respect to x ~ of the KL-divergence between 

Pe(t|x) andp e+A e(t|x). 



arg miriAe C(0 + AO) 

s. t. E x ^ (x) [KL(p g (t\x)\\pe 



-Ae(t|x))] = constant 



(7) 



The metric G is now an expectation over g(x) of an expectation over p(t|x). The former aver- 
ages over possible manifolds generated by different choices of x, while the latter comes from the 
definition of the Fisher Information Matrix. 



Vjv£(#) 



OC{0) 
dO 



^0 



Et~ P (t|> 



31ogp e (t|x)\ f dlogp e (t\x) 



00 



dO 



dC(0) . 
~d0~^ 

(8) 



Note that we use the distribution q instead of the empirical q. This is done in order to emphesis that 
the theory does not force us to use the empirical distribution. However, in practice, we do want q to 
be as close as possible to q such that the curvature of the KL-divergence matches (in some sense) 
the curvature of the error surface. To clarify the effects of q let us consider an example. Assume 
that q is unbalanced with respect to q. Namely it contains twice the amount of elements of a class A 
versus the other B. This means that a change in that affects elements of class A is seen as having a 
larger impact on p (in the KL sense) than a change that affects the prediction of elements in B. Due 
to the formulation of natural gradient, we will move slower along any direction that affects A at the 
expense of B landing with higher probability on solutions 0* that favour predicting A. In practice 
we approximate this expectation over q by a sample average over minibatches. 
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In what follows we consider typical output activation functions and the metrics G they induce. In 
the Appendix we provide a detailed description of how these matrices were obtained starting from 
equation (|8j. Similar derivations were done in Park et al. (2000), which we repeat for convenience. 
The formulas we get for the linear, sigmoid and softmax activation functions are: 
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To efficiently implement the natural gradient, we use a truncated Newton approach following the 
same pipeline as Hessian-Free ( |Martens| |2010| l (more details are provided in the Appendix). We 



rely on Theano (Bergstra et al. 20101 for both flexibility and in order to use GPUs to speed up. 



The advantages of this pipeline are two-fold: (1) it uses the full-rank matrix, without the need for 
explicitely storing it in memory and (2) it does not rely on a smoothness assumption of the metric. 
Unlike other algorithms such as nonlinear conjugate gradient or BFGS, it does not assume that the 
curvature changes slowly as we change 9. This seems to be important for recurrent neural networks 
(as well as probably for deep models) where the curvature can change quickly (Pasc anu et al.\ \2Q\3). 



4 Insights into natural gradient 

Figure [T] considers a one hidden unit auto-encoder, where we minimize the error 
(x — w ■ sigmoid(ti;x + b) + b) 2 and shows the path taken by Newton method (blue), natural gradi- 
ent (gray), Le Roux's version of natural gradient (orange) and gradient descent (purple). On the left 
(larger plot) we show the error surface as a contour plot, where the x-axis represents b and y-axis is 
to. We consider two different starting points ([1] and [3]) and draw the first 100 steps taken by each 
algorithm towards a local minima. The length of every other step is depicted by a different shade 
of color. Hyper-parameters, like learning rate and damping constant, were chosen such to improve 
convergence speed while maintaining stability (i.e. we looked for a smooth path). Values are pro- 
vided in the Appendix. At every point 9 during optimization, natural gradient considers a different 
KL divergence surface, KL(pe \\pe+Ae) parametrized by A9, which has a minima at origin. On the 
right we have contour plots of four different KL surfaces. They correspond to locations indicated by 
black arrows on the path of natural gradient. The x-axis is Ab and y-axis is Aw for the KL surfaces 
subplots. On top of these contour plots we show the direction and length of the steps proposed by 
each of the four considered algorithms. 

The point of this plot is to illustrate that each algorithm can take a different path in the parame- 
ter space towards local minima. In a regime where we have a non-convex problem, with limited 
resources, these path can result in qualitatively different kinds of minima. We can not draw any 
general conclusions about what kind of minima each algorithm finds based on this toy example, 
however we make two observations. First, as showed on the KL-surface plot for [3] the step taken 
by natural gradient can be smaller than gradient descent (i.e. the KL curvature is high) even though 
the error surface curvature is not high (i.e. Newton's method step is larger than gradient descent 
step). Secondly, the direction chosen by natural gradient can be quite different from that of gradient 
descent (see for example in [3] and [4]), which can result in finding a different local minima than 
gradient descent (for e.g. when the model starts at [3]). 



4.1 A comparison between Amari's and Le Roux's natural gradient 



In Le Roux et al. ( 2007 1 a different approach is taken to derive natural gradient. Specifically one 
assumes that the gradients computed over different minibatches are distributed according to a Gaus- 
sian centered around the true gradient with some covariance matrix C. By using the uncertainty 
provided by C we can correct the step that we are taking to maximize the probability of a downward 
move in generalization error (expected negative log-likelihood), resulting in a formula similar to that 
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Error surface 




KL divergence surface at [1] KL divergence surface at [3] 




KL divergence surface at [21 KLdiverqencesurface at [4] 




Figure 1 : Path taken by four different learning algorithms towards a local minima. Newton method 
(blue), natural gradient (gray), Le Roux's natural gradient (orange) and gradient descent (purple). 
See text for details. 



of natural gradient. If g 
direction g = dC *P C -1 where C is: 



^ is the gradient, then Le Roux et al. ( 2007 1 proposes following the 



c = i£, (g-(g» T (g-(g» 



(12) 



While the probabilistic derivation requires the use of the centered covariance, equation (12), in 
Le Roux et al. ( 2007| > it is argued that using the uncentered covariance U is equivalent up to a 
constant resulting in a simplified formula which is sometimes confused with the metric derived by 
Amari. 
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The misunderstanding comes from the fact that the equation has the form of an expectation, though 
the expectation is over the empirical distribution t7(x, t). It is therefore not clear if U tells us how 
pe would change, whereas it is clear that G does. The two methods are just different, and one can 
not straightforwardly borrow the interpretation of one for the other. However, we believe that there 
is an argument strongly suggesting that the protection against drops in generalization error afforded 
by Le Roux's U is also a property shared by the natural gradient's G. 

If KL(p || q) is small, than U can be seen as an approximation to G. Specifically we approximate 
the second expectation from equation ([8]), i.e. the expectation over t ~ pg (t |x), by a single point, the 
corresponding t^. This approximation makes sense when t'*' is a highly probable sample under p 
which happens when we converge. Note that at convergence, U, G and the Hessian are very similar, 
hence both versions of natural gradient and most second order methods would behave similarly. 

An interesting question is if these different paths taken by each algorithm represent qualitatively dif- 
ferent kinds of solutions. We will address this question indirectly by enumerating what implications 
each choice has. 

The first observation has to do with numerical stability. One can express G as a sum of n x o outer 
products (where n is the size of the minibatch over which we estimate the matrix and o is the number 
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of output units) while U is a sum of only n outer products. Since the number of terms in these sums 
provides an upper bound on the rank of each matrix, it follows that one could expect that U will be 
lower rank than G for the same size of the minibatch n. This is also pointed out by |Schraudolph| 
(2002) to motivate the extended Gauss-Newton matrix as middle ground between natural gradient 
and the true HessiarfO 

A second difference regards plateaus of the error surface. Given the formulation of our error function 
in equation |6} (which sums the log of pg(t\x) for specific values of t and x), flat regions of the 
objective function are intrinsically flat regions of the functional manifolc)^] Moving at constant speed 



in the functional space means we should not get stalled near such plateaus. In |Park etal. ( 2000] ) such 



plateaus are found near singularities of the functional manifold, providing a nice framework to study 



them (as is done for example in Rattray et al. ( 1998 1 where they hypothesize that such singularities 



behave like repellors for the dynamics of natural gradient descent). An argument can also be made 
in favour of U at plateaus. If a plateau at 9 exists for most possible inputs x, than the covariance 
matrix will have a small norm (because the vectors in each outer product will be small in value). The 
inverse of U consequentially will be large, meaning that we will take a large step, possibly out of 
the plateau region. This suggest both methods should be able to escape from some plateaus, though 
the reasoning behind the functional manifold approach more clearly motivates this advantage. 

Another observation that is usually made regarding the functional manifold interpretation is that it 
is parametrization-independent. That means that regardless of how we parametrize our model we 
should move at the same speed, property assured by the constraint on the KL-divergence between pg 



and pe+Ae- In Sohl-Dickstein (20121), following this idea, a link is made between natural gradient 



and whitening in parameter space. This property does not transfer directly to the covariance matrix. 

On the other hand Le Roux's method is designed to obtain better generalization errors by moving 
mostly in the directions agreed upon by the gradients on most examples. We will argue that the 
functional manifold approach can also provide a similar property. 

One argument relies on large detrimental changes of the expected log-likelihood, which is what Le 
Roux's natural gradient step protects us from with higher probability. The metric of Amari's natural 
gradient measures the expected (over x) KL-divergence curvature. We argue that if AO induces a 
large change in log-likelihood computed over x g D, where D corresponds to some minibatch, then 
it produces a large change in pg (in the KL sense), i.e. it results in a high KL-curvature. Because 
we move at constant speed on the manifold, we slow down in these high KL-curvature regions, 
and hence we do not allow large detrimental changes to happen. This intuition becomes even more 
suitable when D is larger than the training set, for example by incorporating unlabeled data, and 
hence providing a more accurate measure of how pg changes (in the KL sense). This increase in 
accuracy should allow for better predictions of large changes in the generalization error as opposed 
to only the training error. 

A second argument comes from looking at the Fisher Information matrix 
which has the form of an uncentered weighted covariance matrix of gradients, 



d log pa (t|x] 



de 



Note that these are not the gradients ^ 



that we follow towards a local minima. By using this matrix natural gradient moves in the expected 
direction of low variance for pg. As the cost £ just evaluates pg at certain points tW for a given 
xS l \ we argue that with high probability expected directions of low variance for pg correspond to 
directions of low variance for C. Note that directions of high variance for C indicates direction in 
which pg changes quickly which should be reflected in large changes of the KL. Therefore, in the 
same sense as TONGA, natural gradient avoids directions of high variance that can lead to drops in 
generalization error. 



■ 1 1 i 

Note that Schraudolph (2002 ) assumes a form of natural gradient that uses U as a metric, similar to the 



Le Roux et al.\ j2007| proposed, an assumption made in 



Martens 



{2010} as well. 



4 One could argue that p might be such that C has a low curvature while the curvature of KL is much larger. 
This would happen for example if p is not very sensitive to 9 for the values x^ , t'*' provided in the training 
set, but it is for other pairings of x and t. However we believe that in such a scenario is more useful to move 
slowly, as the other parings of x and t might be relevant for the generalization error 
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4.2 Natural gradient descent versus second order methods 

Even though natural gradient is usually assumed to be a second order method it is more useful, and 
arguably more correct to think of it as a first order method. While it makes use of curvature, it is 
the curvature of the functional manifold and not that of the error function we are trying to minimize. 
The two quantities are different. For example the manifold curvature matrix is positive semi-definite 
by construction while for the Hessian we can have negative curvature. 

To make this distinction clear we can try to see what information carries the metric that we invert 
(as it was done in |Roux and Fitzgibbon| ( pOTO| l for Newton's and Le Roux's methods). 

The functional manifold metric can be written as either the expectation of the Hessian or the 

T 



expectation of the Fisher Information Matrix 



<>l>n 
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(see Q and Q). The first form tells us 



the that the matrix measures how a change in 9 affects the gradients of pg (as the Hessian would 

do for the error). The second form tells us how the change in the input affects the gradients ^i-, as 
the covariance matrix would do for Le Roux's TONGA. However, while the matrix measures both 
the effects of a change in the input and 9 it does so on the functional behaviour of pg who acts as a 
surrogate for the training error. As a consequence we need to look for density functions pg which 
are correlated with the training error, as we do in the examples discussed here. 

Lastly, compared to second order methods, natural gradient lends itself very well to the online 
optimization regime. In principle, in order to apply natural gradient we need an estimate of the 
gradient, which can be the stochastic gradient over a single sample and some reliable measure of 
how our model, through pg, changes with 9 (in the KL sense), which is given by the metric. For 
e.g. in case of probabilistic models like DBMs, the metric relies only on negative samples obtained 
from pg and does not depend on the empirical distribution q at al|Desjardins et aE] ( |2013) l, while 
for a second order method the Hessian would depend on q. For conditional distributions (as is the 
case for neural networks), one good choice is to compute the metric on a held out subset of input 
samples, offering this way an unbiased estimate of how p(t|x) changes with 9. This can easily be 
done in an online regime. Given that we do not even need to have targets for the data over which we 
compute the metric, as G integrates out the random variable t, we could even use unlabeled data to 
improve the accuracy as long as it comes from the same distribution q, which can not be done for 
second order methods. 



5 Natural gradient robustness to overfitting 



We explore the robustness hypothesis from section |4T| empirically. The results of all experiments 
carried out are summarized in table [T] present in the Appendix. Firstly we consider the effects of 
using extra unlabeled data to improve the accuracy of the metric. A similar idea was proposed 
in Sun et al. (2009). The idea is that for G to do a good job in this robustness sense, it has to 



accurately predict the change in KL divergence in every direction. If G is estimated from too little 
data (e.g., a small labeled set) and that data happens to be the training set, then it might "overfit" and 
underestimate the effect of a change in some directions where the training data would tend to push 
us. To protect us against this, what we propose here is the use a large unlabeled set to obtain a more 
generalization-friendly metric G. 

Figure[2]describes the results on the Toronto Face Dataset (TFD), where using unlabeled data results 
in 83.04% accuracy vs 81 . 13% without. State of the art is 85 % Ri fai et al. | ( |20 12) , though this result is 
obtained by a larger model that is pre-trained. Hyper-parameters were validated using a grid-search 
(more details in the Appendix). 

As you can see from the plot, it suggests that using unlabeled data helps to obtain better testing error, 



as predicted by our argument in Sec. 4.1 This comes at a price. Convergence (on the training error) 



is slower than when we use the same training batch. 

Additionally we explore the effect of using different batches of training data to compute the metric. 
The full results as well as experimental setup are provided in the Appendix. It shows that, as most 
second order methods, natural gradient has a tendency to overfit the current minibatch if both the 
metric and the gradient are computed on it. However, as suggested in |Vinyals and Povey (20121 



using different minibatches for the metric helps as we tend not to ignore directions relevant for other 
minibatches. 
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Figure 2: (left) train error (cross entropy over the entire training set) on a log scale and (right) test 
error (percentage of misclassified examples) as a function of number of updates for the Toronto 
Faces Dataset. 'kl, unlabeled' stands for the functional manifold version of natural gradient, where 
the metric is computed over unlabeled data, for 'KL, different training minibatch' we compute 
the metric on a different minibatch from the training set, while 'KL, same minibatch' we compute 
the metric over the same minibatch we computed the gradient hence matching the standard use of 
hessian-free. 'covariance' stands for tonga that uses the covariance matrix as a metric, while msgd 
is minibatch stochastic gradient descent, note that the x axis was interrupted, in order to improve the 
visibility of how the natural gradient methods behave. 



— » NGD validation error 49.8% 
--* MSGD validation error 49.8% 



Fraction of the dataset replaced 

Figure 3: The plot describes how much the model is influenced by different parts of an online 
training set, for the two learning strategies compared (minibatch stochastic gradient descent and 
natural gradient descent). The x-axis indicates which part (1st 10th, 2nd 10th, etc.) of the first half 
of the data was randomly resampled, while the y-axis measures the resulting variance of the output 
due to the change in training data. 



6 Natural gradient is robust to the order of the training set 

We explore the regularization effects of natural gradient descent by looking at the variance of the 
trained model as a function of training samples that it sees. To achieve this we repeat the experiment 
described in Erhan et aT| ( |2010} which looks at how resampling different fraction of the training set 
affects the variance of the model and focuses specifically to the relative higher variance of the early 
examples. Our intuition is that by forbidding large jumps in the KL divergence of pg and following 
the direction of low variance natural gradient will try to limit the amount of overfitting that occurs 
at any stage of learning. 



We repeat the experiment from Erhan et al. ( 2010[ ), using the NISTP dataset introduced in Bengio 
et al. ( |2011[ l (which is just the NIST dataset plus deformations) and use 32. 7M samples of this data. 
We divide the first 16. 3M data into 10 equal size segments. For each data point in the figure, we fix 
9 of the 10 data segments, and over 5 different runs we replace the 10th with 5 different random sets 
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of samples. This is repeated for each of the 10 segments to produce the down curves. By looking 
at the variance of the model outputs on a held out dataset (of 100K samples) after the whole 32. 7M 
online training samples, we visualize the influence of each of the 10 segments on the function learnt 
(i.e., at the end of online training). The curves can be seen in figure[3] 

There are two observation to be made regarding this plot. Firstly, it seems that early examples have a 
relative larger effect on the behaviour of the function than latter ones (phenomena sometimes called 
early-overfitting). This happens for both methods, natural gradient and stochastic gradient descent. 
The second observation regards the overall variance of the learnt model. 

Note that the variance at each point on the curve depends on the speed with which we move in 
functional space. For a fixed number of examples one can artificially tweak the curves for e.g. by 
decreasing the learning rate. With a smaller learning rate we move slower, and since the model, 
from a functional point of view, does not change by much, the variance is lower. In the limit, with 
a learning rate of 0, the model always stays the same. If we increase the number of steps we take 
(i.e. measure the variance after k times more samples) the curve recovers some of its shape. This is 
because we allow the model to move further away from the starting point. 

In order to be fair to the two algorithms, we use the validation error as a measure of how much we 
moved in the functional space. This helps us to chose hyper-parameters such that after 32. 7M sam- 
ples both methods achieve the same validation error of 49.8% (see Appendix for hyper-parameters). 

The results are consistent with our hypothesis that natural gradient avoids making large steps in 
function space during training, staying on the path that induces least variance. Such large steps 
may be present with SGD, possibly yielding the model to overfit (e.g. getting forced into some 
quadrant of parameter space based only on a few examples) resulting in different models at the end. 
By reducing the variance overall the natural gradient becomes more invariant to the order in which 
examples are presented. Note that the relative variance of early examples to the last re-sampled 
fraction is about the same for both natural gradient and stochastic gradient descent. However, the 
amount of variance induced in the learnt model by the early examples for natural gradient is on the 
same magnitude as the variance induce by the last fraction of examples for MSGD (i.e. in a global 
sense natural gradient is less sensitive the order of samples it sees). 



7 The relationship between Hessian-Free and natural gradient 



Hessian-Free as well as Krylov Subspace Descent rely on the extended Gauss-Newton approxima- 
tion of the Hessian, Gn instead of the actual Hessian (see Schraudolph (2002 1). 



G N — 



n ± — ' 



dry d 2 logp(tW|x«) 
dr 2 



= E x ^~ [J^ (E w(t , x) [H £or ]) J r ] (14) 



The reason is not computational, as computing both can be done equally fast, but rather better 
behaviour during learning. This is usually assumed to be caused by the fact that the Gauss-Newton 
is positive semi-definite by construction, so one needs not worry about negative curvature issues. 

In this section we show that in fact the extended Gauss-Newton approximation matches perfectly 
the natural gradient metric, and hence by choosing this specific approximation, one can view both 
algorithms as being implementations of natural gradient rather than typical second order methods. 

The last step of equation ( fl4| i is obtained by using the normal assumption that (x'*),tW) are i.i.d 
samples. We will consider the three activation functions and corresponding errors for which the 
extended Gauss-Newton is defined and show it matches perfectly the natural gradient metric for the 
same activation. 

For the linear output units with square errors we can derive the matrix H£ Dr as follows: 



H£o rii 



d 2 E fc (r fc -t fc ) 2 _ 32(r i -t i ) = q 
dridrj drj 

dridri dvi ~ 



(15) 
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G N — - ^2 J^H£ or J r — - ^2 JyH^oyJy — Jy (21) Jy = 2E x£(; ( x ) [Jyj y ] 

B x«,t(') n x<').t<«) " X W 

(16) 

The result is summarized in equation [16} where we make use of the fact that r = y. It matches the 
corresponding natural gradient metric, equation ( |24| > from section|3] up to a constant. 

In the case of sigmoid units with cross-entropy objective (a is the sigmoid function), H£ or is 



9 2 T. k (-tk log(<r(r fc ))-(l-t fc )log(l-<T(r fc ))) 
dri drj 

a(-t,^- T ^(r I )(l- 1 T(n)) + (l-t,) Tr ^- Tg (r,)(l- CT (r a ))) = g^-U =Q (17) 

If we insert this back into the Gauss-Newton approximation of the Hessian and re-write the equation 
in terms of J y instead of J r , we get, again, the corresponding natural gradient metric, equation ( fTO) . 



H 



Con 



H 



CoTi 



Gn = i Exci),tw J?H £or J r = i Exw J ? dia § (y(i - y)) dia s (yo^y) dia § (y(i - y)) J r 



= E 



Jydiag 



y(i-y! 



(18) 



The last matching activation and error function that we consider is the softmax with cross-entropy. 



H £oriI = ... = = 0(r 4 ) - ^(ri)0(r,) 



(19) 



Equation ( |20| i starts from the natural gradient metric and singles out a matrix M in the formula such 
that the metric can be re-written as the product J^MJ r (similar to the formula for the Gauss-Newton 
approximation). In pTj ) we show that indeed M equals H£ or and hence the natural gradient metric 
is the same as the extended Gauss-Newton matrix for this case as well. Note that 6 is the Kronecker 
delta, where Sijj^j — and 5a = 1. 



G = E 



= E„ 



V° J_ ( dyk_\ T ( d Vk \ 

Z^fc=l y k y dr J \ Or ) 



(20) 



M w¥j = ELi y\wfwj= ELi ( s ki - Vi)Vk (5 k j -Vj) = ViVj - ViVj - ViVj = -<f>(n)(t>(rj) 

M,, = ELi tMM=£ ELi w) + Vi ~ H = <Kn) - Hn)Hn) 

(21) 

8 Conclusion 

In this paper we re-derive natural gradient, by imposing that at each step we follow the direction 
that minimizes the error function while resulting in a constant change in the KL-divergence of the 
probability density function that represents the model. This approach minimizes the amount of 
differential geometry needed, making the algorithm more accessible. 

We show that natural gradient, as proposed by Amari, is not the same as the algorithm proposed by 
Le Roux et al, even though it has the same name. We highlight a few differences of each algorithm 
and hypothesis that Amari's natural gradient should exhibit the same robustness against overfitting 
that Le Roux's algorithm has, but for different reasons. 



10 



We explore empirically this robustness hypothesis, by proving better test errors when unlabeled data 
is used to improve the accuracy of the metric. We also show that natural gradient may reduce the 
worrisome early specialization effect previously observed with online stochastic gradient descent 
applied to deep neural nets, and reducing the variance of the resulting learnt function (with respect 
to the sampled training data). 

By computing the specific metrics needed for standard output activation functions we showed that 
the extended Gauss-Newton approximation of the Hessian coincides with the natural gradient metric 
(provided that the metric is estimated over the same batch of data as the gradient). Given this identity 
one can re-interpret the recently proposed Hessian-Free and Krylov Subspace Descent as natural 
gradient. 

Finally we point out a few differences between typical second order methods and natural gradient. 
The latter seems more suitable for online or probabilistic models, and relies on a surrogate probabil- 
ity density function pg in place of the error function in case of deterministic models. 
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Appendix 

8.1 Expected Hessian to Fisher Information Matrix 

The Fisher Information Matrix form can be obtain from the expected value of the Hessian 
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8.2 Derivation of the natural gradient metrics 
8.2.1 Linear activation function 

In the case of linear outputs we assume that each entry of the vector t, ti comes from a Gaussian 
distribution centered around yi(x) with some standard deviation /?. From this it follows that: 
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8.2.2 Sigmoid activation function 
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In the case of the sigmoid units, i,e, y = sigmoid(r), we assume a binomial distribution which gives 
us: 



P (t|x) = ny**(i-yi) 1 - 



(25) 



logp gives us the usual cross-entropy error used with sigmoid units. We can compute the Fisher 
information matrix as follows: 
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8.2.3 Softmax activation function 

For the softmax activation function, y = softmax(r), p(t|x) takes the form of a multinomial: 



P (t|x) = ny- 1 



(27) 
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8.3 Implementation Details 



We have implemented natural gradient descent using a truncated Newton approach similar to the 
pipeline proposed by Pearlmutter ( 1994) and used by Mar tens| ( |2010| l. In order to better deal with 
singular and ill-conditioned matrices we use the MinRes-QLP algorithm (Choi et dl. 201 1 1 instead 
of linear conjugate gradient. Both Minres-QLP as well as linear conjugate gradient can be found im- 
plemented in Theano at https://github.com/pascanur/theano_optimize We used the Theano library 



(Bergstra et al. 2010) which allows for a flexible implementation of the pipeline, that can automat- 
ically generate the computational graph of the metric times some vector for different models: 
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import theano . tensor as TT 

# 'params ' is the list of Theano variables containing the parameters 

# 'vs' is the list of Theano variable representing the vector 'v 1 

# with whom we want to multiply the metric 

# 'Gvs' is the list of Theano expressions representing the product 

# between the metric and 'vs' 

# , out_smx' 1 is the output of the model with softmax units 

Gvs = TT . Lop (out_smx, params, 

TT . Rop (out_smx, params , vs ) / (out_smx*out_smx . shape [ ] ) ) 

# , out_sig' 1 is the output of the model with sigmoid units 

Gvs = TT . Lop (out_sig, params, 

TT . Rop (out_sig, params , vs ) / (out_sig* 

( l-out_sig) * 
out_sig . shape [ ] ) ) 

# 'out * is the output of the model with linear units 

Gvs = TT . Lop (out , params , TT . Rop (out , params , vs ) /out . shape [ ] ) 

The full pseudo-code of the algorithm (which is very similar to the one for Hessian-Free) is given 
below. The full Theano implementation can be retrieved from https://github.com/pascanur/natgrad 



Algorithm 1 Pseudocode for natural gradient algorithm 

# 'gfn' is a function that computes the metric times some vector 

gfn <— (lambda v —> Gv) 

while not early _stopping_condition do 

& ^ de 

# linear _cg solves the linear system Gx = 
ng<— linear _cg(gfn, g, max Jters = 20, rtol=le-4) 

# 7 is the learning rate 
9 <- 9 - 7 ng 

end while 



Even though we are ensured that G is positive semi-definite by construction, and MinRes-QLP is 
able to find a suitable solutions in case of singular matrices, we still use a damping strategy for two 
reasons. The first one is that we want to take in consideration the inaccuracy of the metric (which 
is approximated only over a small minibatch). The second reason is that natural gradient makes 
sense only in the vicinity of 9 as it is obtained by using a Taylor series approximation, hence (as 
for ordinary second order methods) it is appropriate to enforce a trust region for the gradient. See 
Schaul (20121, where the convergence properties of natural gradient (in a specific case) are studied. 



Following the functional manifold interpretation of the algorithm, we can recover the Levenberg- 
Marquardt heuristic used in Martens ( 2010) by considering a first order Taylor approximation, where 
for any function /, 



This gives as the reduction ratio given by equation ( |30| ) which can be shown to behave identically 
with the one in |Martens1 ( |2010| l. 

f(e t -r lG -^ T )-f(9 t ) 
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8.4 Additional experimental results 

For the one hidden unit auto-encoder we selected hyper-parameters such to ensure stability of train- 
ing, while converging as fast as possible to a minima. We compute the inverse of the metric or 
Hessian exactly (as it is just a 2 by 2 matrix). The learning rate for SGD is set to .1, for Amari's 
natural gradient .5 and for the covariance of gradience 1. (Newton's method usually does not use 
a learning rate). We damped the Hessian and the covariance of gradients by adding I and Amari's 
metric using 0.01 ■ I. 

8.5 Restricted MNIST experiment 

For the restricted MNIST, we train a one hidden layer MLP of 1500 hidden units. The hyper- 
parameters where chosen based on a grid search over learning rate, damping factor and damping 
strategy. Note that beside using unlabeled data, the regularization effect of natural gradient is 
strongly connected to the damping factor which accounts for the uncertainty in the metric (in a 
similar way to how it does in the uncentered covariance version of natural gradient). The minibatch 
size was kept constant to 2500 samples for natural gradient methods and 250 for MSGD. We used a 
constant learning rate and used a budged of 2000 iterations for natural gradient and 40000 iterations 
for MSGD. 

We used a learning rate of 1.0 for MSGD and 5.0 for the functional manifold NGD using unlabeled 
data or the covariance based natural gradient. For the functional manifold NGD using either the 
same training minibatch or a different batch from the training set for computing the metric we set 
the learning rate to 0.1. We use a Levenberg-Marquardt heuristic only when using unlabeled data, 
otherwise the damping factor was kept constant. Its initial value was 2.0 for when using unlabeled 
data, and 0.01 for every case except when using the covariance of the gradients as the metric, when 
is set to 0. 1 . 




45 50 TOO 150 200 1950 2000 2050 2100 2150 2200 2250 23& 5 

number updates/10 number updates/10 



Figure 4: (left) train error (cross entropy over the entire training set) on a log scale in order to im- 
prove visibility and (right) test error (percentage of misclassified examples) as a function of number 
of updates for the restricted mnist dataset. 



8.6 MNIST experiment 

The model used has 3 layers, where the first two are convolutional layers both with filters of size 
5x5. We used 32 filters on the first layer and 64 on the second. The last layer forms an MLP with 750 
hidden units. We used minibatches of 10000 examples (for both the gradient and the metric), and a 
j decaying learning rate strategy. The learning rate was kept constant for the first 200 updates and 
then it was computed based on the formula - — f° 200 , where t is the number of the current update. 
We used a budged ot 2000 update. 

The learning rate was set to 0.5 for the functional manifold approach when using a different batch 
for computing the metric and 1.0 when using the same batch for computing the metric, or for using 
the covariance of gradients as metric. We use a Levenberg-Marquardt heuristic to adapt the damping 
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Table 1: Results on the three datasets considered (restricted MNIST, MNIST and TFD). Note that 
different models are used for different datasets. The training error is given as cross-entropy error, 
while the test error is percentage of miss-classified examples. The algorithms name are the same as 
in the legend of figure [2] 



Data set 


Data fold 


MSGD 


KL, UNLABELED 


KL, DIFFERENT 
BATCH 


KL, SAME 
BATCH 


COVARIANCE 


Restricted 


train 


0.0523 


0.0017 


0.0012 


0.0023 


0.0006 


MNIST 


TEST 


5.22% 


4.63% 


4.89% 


4.91% 


4.74% 


MNIST 


TRAIN 






0.00010 


0.0011 


0.024 




TEST 






0.78% 


0.82% 


1.07% 


TFD 


TRAIN 




0.054 


0.098 








TEST 




16.96% 


18.87% 







factor which initially is 5.0 for the functional manifold approach, and a constant damping factor of 
0.1 for using the covariance as metric. These values were validated by a grid search. 



8.7 TFD experiment 

The Toronto Face Dataset (TFD), has a large amount of unlabeled data of poorer quality than the 
training set. To ensure that the noise in the unlabeled data does not affect the metric, we compute the 
metric over the training batch plus unlabeled samples. We used a three hidden layer model, where 
the first layer is a convolutional layer of 300 filters of size 12x12. The second two layers from a 2 
hidden layer MLP of 2048 and 1024 hidden units respectively. 

For the TFD experiment we used the same decaying learning rate strategy introduced above, in 
subsection [83] where we computed gradients over the minibatch of 960 examples. When using the 
unlabeled data, we added 480 unlabeled examples to the 960 used to compute the gradient (therefore 
the metric was computed over 1440 examples) otherwise we used the same 960 examples for the 
metric. In both cases we used an initial damping factor of 8, and the Levenberg-Marquardt heuristic 
to adapt this damping value. Initial learning rate l was set to 1 in both cases. 

Note that we get only 83.04% accuracy on this dataset, when the state of the art is 85.0% Rifai et al. 
( |2012| l, but our first layer is roughly 3 times smaller (300 filters versus 1024). 

8.8 NISTP exepriment (robustness to the order of training samples) 

The model we experimented with was an MLP of only 500 hidden units. We compute the gradients 
for both MSGD and natural gradient over minibatches of 512 examples. In case of natural gradient 
we compute the metric over the same input batch of 512 examples. Additionally we use a constant 
damping factor of 3 to account for the noise in the metric (and ill-conditioning since we only use 
batches of 5 12 samples). The learning rates were kept constant, and we use .2 for the natural gradient 
and. 1 for MSGD. 
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Figure 5: Train and test error (cross entropy) on a log scale as a function of number of updates for 
the MNIST dataset. The legend is similar to figure [2] 
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